Dimensionality reduction for zero-inflated single cell gene expression analysis
نویسندگان
چکیده
Single cell RNA-seq data allows insight into normal cellular function and diseases including cancer through the molecular characterisation of cellular state at the single-cell level. Dimensionality reduction of such high-dimensional datasets is essential for visualization and analysis, but single-cell RNA-seq data is challenging for classical dimensionality reduction methods because of the prevalence of dropout events leading to zero-inflated data. Here we develop a dimensionality reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves performance on simulated and biological datasets. Text: Single cell RNA expression analysis (scRNA-seq) is revolutionizing whole-organism science allowing the unbiased identification of previously uncharacterized molecular heterogeneity at the cellular level. Statistical analysis of single cell gene expression profiles can highlight putative cellular subtypes, delineating subgroups of T-cells, lung cells and myoblasts. These subgroups can be clinically relevant: for example, individual brain tumors contain cells from multiple types of brain cancers, and greater tumor heterogeneity is associated with worse prognosis. Despite the success of early single cell studies, the statistical tools that have been applied to date are largely generic, rarely taking into account the particular structural features of single cell expression data. In particular, single cell gene expression data contains an abundance of dropout events that lead to zero expression measurements. These dropout events may be the result of technical sampling effects (due to low transcript numbers) or real biology arising from stochastic transcriptional activity (Fig. 1a). Here, we show that the performance of standard dimensionality-reduction algorithms on high-dimensional, single cell expression data can be perturbed by the presence of zero-inflation making them sub-optimal. We present a new dimensionality-reduction model, Zero-Inflated Factor Analysis (ZIFA), that explicitly accounts for the presence of dropouts, and demonstrate that ZIFA outperforms other methods on simulated data and single cell data from recent scRNA-seq studies. The fundamental empirical observation that underlies the zero-inflation model in ZIFA is that the dropout rate for a gene depends on the expected expression level of that gene in the population. Genes with lower expression magnitude are more likely to be affected by dropout than genes that are expressed with greater magnitude. In particular, if the mean level of non-zero expression is given by μ and the dropout rate for that gene by p0, we have found that this dropout relationship can be approximately modelled with a parametric form p0 = exp(-λμ), where λ is a fitted parameter, based on a double exponential function. This relationship is consistent with previous investigations and holds in many existing single cell datasets (Fig. 1b). The use of this parametric form permits fast, tractable linear algebra computations in ZIFA enabling its use on realistically sized datasets in a multivariate setting. ZIFA adopts a latent variable model based on the Factor Analysis (FA) framework and augments it with an additional zero-inflation modulation layer. Like FA, the data generation process assumes that the peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/019141 doi: bioRxiv preprint first posted online May. 8, 2015;
منابع مشابه
Bioconductor workflow for single-cell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference
Novel single-cell transcriptome sequencing assays allow researchers to measure gene expression levels at the resolution of single cells and offer the unprecendented opportunity to investigate at the molecular level fundamental biological questions, such as stem cell differentiation or the discovery and characterization of rare cell types. However, such assays raise challenging statistical and c...
متن کاملHurdle, Inflated Poisson and Inflated Negative Binomial Regression Models for Analysis of Count Data with Extra Zeros
In this paper, we propose Hurdle regression models for analysing count responses with extra zeros. A method of estimating maximum likelihood is used to estimate model parameters. The application of the proposed model is presented in insurance dataset. In this example, there are many numbers of claims equal to zero is considered that clarify the application of the model with a zero-inflat...
متن کاملA Novel Dimensionality Reduction Technique Based on Independent Component Analysis for Modeling Microarray Gene Expression Data
DNA microarray experiments generating thousands of gene expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. But one challenge of microarray studies is the fact that the number n of samples collected is relatively small compared to the number p of genes per sample which are usu...
متن کاملData exploration, quality control and testing in single-cell qPCR-based gene expression experiments
MOTIVATION Cell populations are never truly homogeneous; individual cells exist in biochemical states that define functional differences between them. New technology based on microfluidic arrays combined with multiplexed quantitative polymerase chain reactions now enables high-throughput single-cell gene expression measurement, allowing assessment of cellular heterogeneity. However, few analyti...
متن کاملEffects of combined 5-Fluorouracil and ZnO NPs on human breast cancer MCF-7 Cells: P53 gene expression, Bcl-2 signaling pathway, and invasion activity
Objective(s): The significant contribution of nanoparticles to cancer treatment has attracted therapeutic attention. The present study aimed to evaluate the synergistic effects of 5-fluorouracil (5-FU) and zinc oxide nanoparticles (ZnO NPs) as multimodal drug delivery on human breast cancer MCF-7 cells.Materials and Methods: In this in-vitro study, the impact of 5-FU and ZnO NPs in the sin...
متن کامل